Efficient Partitioning Strategies for Distributed Web Crawling

نویسندگان

  • José Exposto
  • Joaquim Macedo
  • António Pina
  • Albano Alves
  • José Rufino
چکیده

This paper presents a multi-objective approach to Web space partitioning, aimed to improve distributed crawling efficiency. The investigation is supported by the construction of two different weighted graphs. The first is used to model the topological communication infrastructure between crawlers and Web servers and the second is used to represent the amount of link connections between servers’ pages. The values of the graph edges represent, respectively, computed RTTs and pages links between nodes. The two graphs are further combined, using a multi-objective partitioning algorithm, to support Web space partitioning and load allocation for an adaptable number of geographical distributed crawlers. Partitioning strategies were evaluated by varying the number of partitions (crawlers) to obtain merit figures for: i) download time, ii) exchange time and iii) relocation time. Evaluation has showed that our partitioning schemes outperform traditional hostname hash based counterparts in all evaluated metric, achieving on average 18% reduction for download time, 78% reduction for exchange time and 46% reduction for relocation

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Index Partitioning Strategies for Peer-to-Peer Web Archival

The World Wide Web has become a key source of knowledge pertaining to almost every walk of life. The goal is to build a scalable peer-to-peer framework for web archival and to further support time-travel search over it.We provide an initial design with crawling, persistent storage and indexing and also analyze the partitioning strategies for historical analysis of data. Peer-to-peer (p2p) syste...

متن کامل

WebParF: A Web partitioning framework for Parallel Crawlers

With the ever proliferating size and scale of the WWW [1], efficient ways of exploring content are of increasing importance. How can we efficiently retrieve information from it through crawling? And in this “era of tera” and multi-core processors, we ought to think of multi-threaded processes as a serving solution. So, even better how can we improve the crawling performance by using parallel cr...

متن کامل

A Study of Focused Web Crawlers for Semantic Web

Finding useful information from the web which has a large and distributed structure requires efficient search strategies. Focused crawlers selectively retrieve Web documents that are relevant to a predefined set of topics. To intelligently make decisions about relevant URLs and web pages, different authors had proposed different strategies. In this paper we review and compare focused crawling s...

متن کامل

Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System

Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental...

متن کامل

A Scalable P2P RIA Crawling System with Partial Knowledge

Rich Internet Applications are widely used as they are interactive and user friendly. Automated tools for crawling Rich Internet Applications have become needed for many reasons such as content indexing or testing for correctness and security. Due to the large size of RIAs, distributed crawling has been introduced to reduce the amount of time required for crawling. However, having one controlle...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007